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Abstract 

Implicit finite difference schemes are often the preferred numerical schemes in com- 
putational fluid dynamics, requiring less stringent stability bounds than the explicit 
schemes. Each iteration in an implicit scheme, however, involves global data depen- 
dencies in the form of second and higher order recurrences. Efficient parallel imple- 
mentations of such iterative methods, therefore, are considerably more difficult and 
non-intuitive. In this paper, we consider the parallelization of the implicit schemes that 
are used for solving the Euler and the thin layer Navier-Stokes equations and that re- 
quire inversions of large linear systems in the form of block tri-diagonal and/or block 
penta-diagonal matrices. We focus our attention on three-dimensional cases and present 
schemes that minimize the total execution time. We describe partitioning and schedul- 
ing schemes for alleviating the effects of the global data dependencies. An analysis of 
the communication and the computation aspects of these methods is presented. The 
effect of the boundary conditions on the parallel schemes is also discussed. The ARC-3D 
code, developed at NASA Ames, is used as an example application. Performance of the 
proposed methods is verified on the Victor multiprocessor system which is a message 
passing architecture developed at the I6M, T. J. Watson Research Center. 
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1 Introduction 

Efficient parallel implementation of any iterative method depends on the characteristics of both the 
particular application problem to be solved and the architecture of the parallel machine on which 
it is to be solved. Using iterative methods to find the numerical solution of a partial differential 
equation with boundary conditions generally involves more than just the repeated solutions of linear 
systems of the form 

Ax — b. (1) 

The elements of A and b must also be evaluated at every iteration. Depending on the equations 
and the boundary conditions, these calculations can be a significant part of the solution cost. In 
many cases the data dependencies of the iterative method and those in evaluating the elements of 
A and b are considerably different. Therefore load balancing and communication requirements vary 
from one part of the iterative process to the next. Thus, evaluating the parallelism of an iterative 
method by considering just the cost of solving Equation 1 is not necessarily a good indicator of the 
parallelism of the corresponding iterative method for solving the partial differential equation. 

In this paper we consider some of the issues involved in the efficient parallelization of iterative 
schemes in the overall context of a computational fluid dynamics (CFD) application. In particular, 
we examine a class of algorithms known as the implicit schemes for their parallelization properties 
on message passing, MIMD multiprocessor systems. We show here that, with appropriate rear- 
rangement of computation and by using suitable partitioning strategy, one can extract high degree 
of parallelism from these schemes. The ARC-3D code, developed at NASA Ames, is used as a 
model application. Experimental results are presented from the implementations on the Victor 
multiprocessor system developed at IBM, T. J. Watson Research Center. 

The organization of the rest of the paper is as follows. In the next section, various iterative 
methods, explicit and implicit, used in the CFD applications are surveyed and their important 
properties are discussed. The three dimensional Euler and the thin-layer Navier- Stokes equations 
for modeling the compressible flow of a gas over a solid body are described in Section 3. The 
numerical methods used for solving these model equations are considered in the remainder of the 
paper. The implicit schemes that are of interest are described in Section 4. In Section 5, these 
schemes are broadly classified into three categories and the parallel properties of each class is 
analyzed. The salient points of the ARC-3D code and the implementation aspects are described in 
Sections 6, 7 and 8. Results are presented in Section 9 and finally Section 10 concludes the paper. 

2 Numerical Methods in CFD 

Whether using finite-element, finite-difference, finite-volume or spectral methods, efficient methods 
for solving the time dependent Euler or Navier-Stokes equations typically involve two basic types of 
time stepping schemes: the explicit and the implicit. The most easily parallelizable and relatively 
inexpensive are the explicit schemes which can be performed simultaneously at every grid element, 
often with highly localized spatial data dependencies. However, these explicit methods, with a 
localized spatial domain of dependence, are characterized by a time step restriction imposed by 
the Courant-Friedrichs-Lewy (CFL) condition. The implicit schemes, on the other hand, are not 
restricted by the CFL condition and are often advantageous for stiff problems which contain several 
time scales or for solving steady state problems using the time dependent equations as a device for 
the iterative solution of the steady state equations. The time step of an implicit method may be 
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restricted by the need to maintain a desired level of accuracy or to maintain numerical stability. In 
general, the reduction in the number of time steps for an implicit method must be weighed against 
the increase in the cost of the implicit calculations. On uni-processor and vector architectures, 
the implicit schemes have proven to be superior in many instances. For multiprocessor systems, in 
addition to the above mentioned factors, the data dependencies must also be taken into account. 
Because of the inherent global spatial data dependencies that generally require the solution of 
large sparse linear systems, the implicit methods are less amenable to parallelization. In this 
paper we show that by suitable rearrangement of the computation steps and by using appropriate 
communication and data partitioning schemes, it is possible to retain the efficiency of the implicit 
schemes on parallel systems as well. 

Examples of explicit time stepping schemes are the Lax Wendroff methods such as MacCor- 
mack’s predictor/corrector scheme (MacCormack, [16], [17]), the linear multistep methods such as 
leap frog and Adams-Bashforth, and the one step multistage schemes such as Runge-Kutta (Jame- 
son, Schmidt, Turkel, [12]). Note that, regardless of the explicit or implicit time stepping scheme 
used, many of the other calculations involved in typical CFD codes, such as evaluating the residuals 
or the pressure have data dependencies similar to the explicit schemes. 

Implicit time stepping schemes appear in a variety different methods. These include the Ap- 
proximate Factorization (AF) / Alternating Direction Implicit (ADI) methods (Bredif, [5]; Beam, 
Warming, [3]; Abarbanel, Dwoyer, Gottlieb, [1]) and the closely related Fractional Step methods 
(Yanenko,[29]) and the LU implicit methods (Jameson, Turkel, [13]). In recent years, various Jacobi 
and Gauss-Seidel type iterative methods have been applied to solve the flux-split equations, either 
using the Newton linearization, (Chakravarthy,[7]) or the switched evolution/relaxation (SER) 
method, (Mulder and Van Leer, [27]). Implicit methods are also used to precondition minimum 
residual or conjugate gradient algorithms, for example, the Incomplete LU (Meijerink, Van der 
Vorst, [18]) and the Strongly Implicit Procedure (Stone, [25]). Most of these methods have also 
been used as smoothers for multigrid methods or in conjunction with explicit schemes. (See [4], 
[9], [24], [26].) Finally, implicit calculations, in the form of residual averaging, are used in the 
FLO codes, (Jameson, [11]), to stabilize and accelerate the convergence of an explicit Runge-Kutta 
multigrid method. 


3 Equations 

We consider implicit finite difference schemes for the three dimensional Euler and thin-layer Navier- 
Stokes equations which model the compressible flow of a gas over a solid body. The domain in 
Cartesian coordinates, x, y and z is mapped onto a computational domain with a general curvilinear 
coordinate transformation given by 


T = t, £ = £(x,y,z,<)> v = C = C(x,y,z,t). 


The surface of the solid body is assumed to be on a plane given by ( = constant. In these coordi- 
nates, the time-dependent thin-layer Navier-Stokes equations are 


dQ dE OF dG n ,dS 

dr + d{ + a v + d( ~ Re d( 


( 2 ) 
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where Q is the vector of conserved quantities, 


Q = 


p 

pu 

pv 

pw 

e 


and p, pu, pv, pw and e are, respectively, the density, the x, y, and z Cartesian components of 
momentum and the total energy per unit volume. The vector fluxes are given by, 


P U 


pV 


pW 

puu + i x p 


puV + T) x p 


puW + Ci P 

pvU + iyP 

, F = J~ 1 

pvV + r)yP 

, G = J -1 

pvW + CyP 

pwu -i- i z p 


pwV + r) z p 


pwW + CzP 

. U{e + p) - itP . 


V(e + p) - ViP . 


_ W{e + p) - CtP . 


with the contravariant velocities, U , V , and W , defined as: 

U = it + ixU + i v v + i z w 

V = T] t + T) X U + T]yV + T] Z W 
w = (t + (xU+(vV + ( z W 


and J is the metric Jacobian. 

For a perfect gas, the pressure, p, and speed of sound, a, are given by 

p = (7 - l)(e - \p{ u2 + y2 + w 2 )) 

2 IP 
a = — 

P 


where 7 is the ratio of specific heats. 

The thin-layer viscous term on the right hand side of the equation is given by 


5 = J' 1 


0 

pmiU( + (/i/3 )m 2 ( x 
pmivt + (/i/3 )m 2 ( y 
p.miW{ + (/i/3)m2Cz 
/imim 3 + (/i/ 3 ) 7 n 2 (Crii + ( v v + ( z w) 


where 

mj = £ + <l + C 2 Z 

m 2 = + CyV( + ( Z W{ 

m 3 = (u 2 + u 2 + w 2 )/2 + Pr l {?t — l) _1 ( a2 )c- 

The parameters p, Re and Pr are the dynamic viscosity, Reynolds number and Prandtl number, 
respectively. 
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The metric terms can be obtained by chain rule from the definitions of ( , rj and (. The curvilinear 
derivatives in terms of the Cartesian derivatives are given by: 


fx = J(y v Z( ~ y(Z v ) 

Vx = J{z<yc - zat) 
Cx = J(ytz v - y v z{) 


f V = J( z fl x C - Z C X u) 
rjy = /(x { 2 ( - x c z() 

C V ~ J{ Z t X T) — z v x () 


6* = J(x v y( - x ( y v ) 

Vz = J(yt x c - y< x t ) 
Cz = J(xty v - *r,yt) 


ft — ®rfx J^rfy z t£z 

TJt - ~XrVx ~ VtVv ~ z tVz 

Ct = %tCx VrCy z rCz 


and J is given by, 


J 1 = X ( yr,Z ( + X ( y ( Zr, + XrftZt - X(l l ( Zr, - X^Zf - X<y„Zf. 

The equations presented above describe Navier-Stokes equations in three dimensions after ap- 
plying the thin-layer approximation theory. The Euler equations, for inviscid flow, are obtained by 
setting the viscous term on the right hand side of Equation 2 to zero. 


4 Implicit Methods 

A numerical solution of Equation 2 can be obtained by discretizing in both time and space. We first 
describe the time discretizations which lead to the implicit schemes. A single step time discretization 
of Equation 2 can be written as 


Q»+l _ Qn 

t n+1 - 


+ 


,dE n+1 

a( — 

-(i - «x 


ajm+i 9G ”+ 1 _idS n+ \ 
+ dj] + d( ~ Re ~d(~^ 
dE n dF n dG n ,05" 
~df + -foj~ + ~dc~ ~ Re 


Here, Q n is the state of the system at time step t n , and, for example, 


E n = E(Q n ). 


A Taylor’s series expansion of the vector flux terms and the thin layer viscous flux term can be 
used to linearize the (n + 1) terms giving 

E n+1 a E n + ^n^n+l _ gnj 

F n+1 « F n + B n (Q n+1 - Q n ) 

G n+1 ~ G n + C n (Q n+1 - Q n ) 

5 n+i ^ gn + M n(Qn+l _ Qny 


A, B , C and M, the 5x5 matrices, are the flux Jacobians, given by and f^, respectively, 

and A Q n = Q n+1 — Q n . These linearized implicit equations can be written in the “delta form” as: 


/+ “ A 4^ + ^ fln+ l c " 


Re~'^M n )\AQ 


(3) 
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. ,dE n dF n , dG n -idS n 
= -At(—+— + — -Re 


d( d-q d( 


d( 


The choice of the parameter a in the above equations determines the implicit scheme. When 
a is equal to one, this is a first-order backward Euler scheme, a = 1/2 is a second-order Crank- 
Nicolson, or trapezoidal scheme, and a = 0 is forward Euler, an explicit scheme. In the limit, as 
At approaches infinity, and if a = 1, this becomes a linearized Newton’s method. 

We write Equation 3 more simply as: 


(7 + aAtN)AQ = -AtR 


( 4 ) 


where 


and the explicit terms on 


the right hand side are given by 


dE n dF n dG n _ _ 1 dS n 

R =w + w + w- Re nr 


The space discretization is performed on a uniform grid in the generalized coordinate system. 
Using second order central space differencing of the five coupled equations with lexicographic or- 
dering of the grid points, Equation 4 results in a large (5 X 5) block banded linear system. For 
three dimensional problems, the bandwidth is proportional to the number of grid points in a plane. 
Therefore, the direct solution of the above system is impractical. 

Without changing the accuracy of the scheme, the bandwidth of the linear system can be 
reduced by factoring 7 + aAtN . The factored system can then be solved with a direct inversion 
of each of the factors. For example, in the Approximate Factorization (AF) methods, such as the 
Beam- Warming scheme, the linear operator is written as the product of three factors, one for each 
coordinate direction. Thus, 7 + aAtN is factored as 


(7 + a At^-A n )(I + aAt f-B n )(7 + aAt|-C n - aAtRe~ l |-Af n )AQ n = - AtR n . (5) 
' d£ dr] oQ 

Central finite differencing of each factor, yields a block tri-diagonal matrix, where the off-diagonal 
5x5 blocks are dense. An alternative to the three way factorization is an LU factorization. Here, 
7 + aAtN is factored so that Equation 4 becomes: 


LUAQ n = -AtR 

where L contains only backward-differenced and U contains only forward-differenced discretizations 
of the spatial derivatives. Ordering the grid points accordingly, L and U become lower and upper 
triangular matrices. The bandwidth of these matrices is as large as that of the unfactored system, 
but now the inversion can be accomplished in just two sequential sweeps over the grid points. 

Iterative methods can also be used to solve the unfactored system. Various relaxation or splitting 
methods can be obtained by approximating data at off-diagonals by old data (at time t ). Jacobi, 
Gauss-Seidel and SSOR methods can be used, in their many three dimensional forms. For example, 
the simplest is a point Jacobi method, where values at each point are updated using old values at all 
surrounding grid points. A more elaborate method is a red/black(checkerboard)-line Gauss-Seidel, 
obtained by performing tridiagonal solves on all ‘red’ lines in a plane, using the old data at the 
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‘black’ points, followed by tridiagonal solves on all black lines, using the new values at the red 
points. This could be done first in the £ — 77 plane, then in the 77 — ( plane, and then in the ( - £ 
plane. 

No matter which method, direct or iterative, it is the data dependencies which determine the 
available parallelism. In the next section, all of these methods are grouped into three distinct data 
dependence categories. 

5 Parallelization of Implicit Methods 

The schemes described above, although all are implicit, exhibit different degrees of parallelism. 
In general, the extent to which any of these methods is parallelizable can often be predicted by 
examining the nature of the inter-grid data dependencies in going from one time step to the next. 
We find it convenient to classify the methods into three categories based on the dependencies: the 
parallel-by-point schemes, the parallel-by-line schemes, and the parallel-by-plane schemes. These 
categories in effect characterize the granularity of the available parallelism and, quantitatively, the 
degree of the extractable parallelism. As a consequence, the implementation techniques for all the 
schemes within a category and their expected performance are similar. 


Parallel-by-point 

Multistage methods 
(e.g., Runge-Kutta) 
Lax-Wendroff methods 
(e.g., MacCormack) 
Multistep methods 
(e.g., leap frog) 
Point Jacobi 


Parallel-by-line 

ADI/AF methods 

(e.g., Beam- Warming) 
Fractional step methods 

Red/black-line Gauss-Seidel 

Line Jacobi 


P arallel- by- plan e 
Implicit LU 
ILU and SIP 
Point/line Gauss-Seidel 
SSOR 


Figure 1: A classification of implicit methods based on available parallelism. 


In the following, we describe the main properties of the three categories using a three-dimensional 
7i x n x n grid as an example. Note that the properties are unaffected even if the grid is non- 
uniform and even if it has unequal number of grid points in each of the three dimensions. 

The ‘parallel-by-point’ schemes are the simplest. To advance to the next time step, computa- 
tions can be done at each grid point independently of the computations at the other grid points. 
Depending on the discretization stencil used, values from the previous time step at one or more 
neighboring grid points are needed in the computations at any grid point. The time-explicit schemes 
are all examples of ‘parallel-by-point’ schemes. A few of the time-implicit schemes also fit in this 
category, for example, point Jacobi on the unfactored equations. The ‘parallel-by-point ’ schemes 
are extensively analyzed in the literature for partitioning and for minimizing the communication 
overhead. (See, for example, Reed et al., [23].) On an n X n X n grid, up to n 3 proces- 
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sors can be used without introducing any computational sequentiality. Any processor assignment 
scheme which allocates approximately the same number of points to each processor gives a good 
first approximation to balancing the computation load. On distributed memory architectures, a 
static three dimensional block partitioning of the grid is usually sufficient to minimize the com- 
munication overhead, provided the interprocessor communication network provides at least a three 
dimensional connectivity. In practice, two dimensional partitioning schemes also suffice in reducing 
the communication costs close to a minimum. Improved load balancing of the computation and 
minimization of the communication overhead is usually achieved by an analysis of the effect of the 
data dependencies at the boundaries. 








HI 


Eg 


i 


parallel-by-plane 


Figure 2: Types of parallelism. 


‘Parallel-by-line’ schemes are a little more difficult. To advance to the next time step, the 
computation at each grid point is coupled to other grid points in a fixed coordinate direction, but 
can be done independently of the computations at grid points on all other lines in that coordinate 
direction. In other words, in the ‘parallel-by-line’ schemes the dependencies are such that all the 
grid points on a line must be updated together. The computations on different lines, however, may 
proceed without any dependence or communication delays. The quasi-one-dimensional algorithms 
such as ADI, AF and fractional step algorithms are in this category, as well as some of the line 
relaxation methods listed in Figure 1. For an n X n X n grid, there are n 2 grid lines in 
each of the three dimensions and hence, as long as there are n 2 or less processors, the optimal 
sequential algorithms can usually be modified to give a good parallel algorithm. With more than 
n 2 processors, the solution process must usually be ‘blocked out’. That is, each processor performs 
as many independent (parallel) computations on local data as possible before combining information 
with other processors. Having reduced the number of coupled equations, the remaining equations 
are solved by the same method by a subset of the processors. If more than one line is to be solved 
simultaneously, the rest of the processor continue the reduction on other lines. A typical example 
of this is when performing tridiagonal line solves. The sequentially optimal Thomas algorithm can 
be replaced by a substructuring/cyclic reduction algorithm to extract more parallelism. Note that 


7 



the cyclic reduction algorithms require more than twice as many floating point operations per grid 
point as the Thomas algorithm. Moreover, for these algorithms the dependencies fan out requiring 
richer interprocessor connectivity to reduce the communication overhead. The cyclic reduction 
type algorithms, however, have the advantage that more of these operations can be done in parallel 
(Heller, [8]; Ho and Johnsson, [10]). 

The third category of ‘parallel-by-plane* involves dependencies that span two or more dimen- 
sions and are more difficult to handle than the other two cases. LU factorization and Gauss-Seidel 
methods are examples of these methods. Extracting parallelism from LU factorization and Gauss- 
Seidel methods is possible, but is much more difficult, see (Naik,[19]). For a three dimensional 
problem, these methods are vectorizable along j -f k + / = constant planes, but the vector length 
varies. Similarly, the parallelization could be done in planes, at the cost of inefficient processor uti- 
lization near the corners of the computational domain. Because of the nonlinearity of the equations, 
the triangular LU factors are different at every point and at every time step. The parallelization 
methods of Anderson and Saad [2], which involve preprocessing overhead which must be amortized 
over many iterations to be worthwhile, are not useful in this context. Because of their improved 
convergence properties and increased stability, these methods warrant further investigation. 

For the remainder of this paper, we consider the parallelization of a ‘parallel-by-line’ scheme: 
the Beam- Warming Approximate Factorization scheme. The AF methods have been extensively 
studied. They can be made to be total variation diminishing (TVD), (Chakravarthy et al.,[6]) 
and the boundary conditions are well understood, (LeVeque,[15]). The ARC-3D code developed at 
NASA Ames is based on this scheme. In the following , we first briefly described the key aspects 
of ARC-3D, referring the reader for details to Pulliam and Steger, [22], and then discuss possible 
parallel implementations. In Section 9 we describe our implementation on the Victor multiprocessor 
system at IBM, Yorktown Heights. 

6 ARC-3D 

A widely distributed, general purpose implicit finite difference code is the ARC-3D code developed 
at NASA Ames, (Pulliam and Steger, [21]). It is based on the Beam- Warming Approximate Fac- 
torization scheme, described in Section 4. The numerical solution of the factored equation involves 
inverting the three factors shown in Equation 5. Since each of the three factors represents a recur- 
sive coupling of information along a particular spatial direction, the inversion of a factor has global 
data dependencies, usually in the form of block tridiagonal matrices, in that direction. 

Although the computation costs associated with Approximate Factorization scheme are consid- 
erably less than direct solves, they are still expensive. A significant reduction (40%) in computation 
can be achieved using the diagonalization of the blocks in the implicit operator, as developed by 
Pulliam and Chaussee [20]. The flux Jacobians, A n , B n and C n in Equation 5 can be diagonalized 
as: 

A* = T^A n T 0 A,, = T- l B n T„- t A c = T^C n T c . 

For steady state calculations, a modified factored algorithm for the Euler equations, which is now 
at most first order accurate in time, is given by 

T ( (I + Atd(A()N (I + AtdnAjPil + Atd c A ( )T^ 1 AQ n = R n 

with N = and P = For the explicit form of the A’s, T's, N and P, see (Pulliam and 

Steger, [22]). A fourth order finite difference discretization of the implicit factors yields scalar penta- 
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diagonal and (5x5) block diagonal factors. Thus, the numerical solution using the diagonalization 
technique consists of inverting seven factors, four of which are in (5 X 5) block diagonal form and 
the other three, which correspond to the three spatial directions, are in scalar penta-diagonal form. 
As in the block tri-diagonal case, the inversion of the scalar penta-diagonal matrices has global 
dependencies in a particular spatial direction, but now the five variables at each grid point are 
decoupled. 

We consider here, two versions of ARC-3D code for examining the implementation and perfor- 
mance issues in parallelizing some of the implicit schemes described and characterized in Sections 4 
and 5. The first version corresponds to the standard way of computing the approximate factors 
and is based on the solution of the block tri-diagonal matrices. We refer to this version of ARC-3D 
as the block tri-diagonal version. The other version considered here is based on the diagonalization 
technique described above. This version of ARC-3D is referred to as the scalar penta-diagonal ver- 
sion. In the following, we describe the computation steps and the corresponding data dependencies 
in the two versions. 


• 

penta-diagonal 

block tridiagonal 

+ 

X 

-f 


+ 

X 

-T 


RHS 

290 

356 

37 

3 

290 

356 

37 

3 

IMPLICIT 









- setup 

235 

410 

26 

6 

389 

512 

18 

0 

- forward solves 

66 

99 

6 

0 

495 

510 

117 

0 

- back solves 

i 

30 

30 

0 

0 

75 

75 

0 

0 

TOTAL 

331 

539 

32 

6 

959 

1097 

135 

0 


Table 1: Floating point operations per grid point, non-viscous case. 


The implementation of the Beam-Warming scheme in generalized coordinates involves a se- 
quence of separate tasks. These tasks include an initialization part where the computations corre- 
sponding to initial setup are made. This is followed by computations over each time step. Each 
time step requires the solution of Equation 5 to find the correction, A Q n * The computation at 
each time step is accomplished numerically by solving either a (5 X 5) block tridiagonal system 
or by solving a scalar penta-diagonal system for each line, in each of the three spatial directions. 
Thus, for an n x n X n grid, there are n 2 such systems to be solved for each direction. At the com- 
putational level, however, four separate tasks are necessary for advancing a time step. First, the 
discrete boundary conditions must be calculated at all boundaries. The inflow, outflow and solid 
body boundary conditions must be set, and any symmetry or singular conditions related to the 
computational domain must be enforced. Then the fluxes, numerical viscosity and the thin layer 
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viscous terms on the right hand side of Equation 5 are evaluated, using the appropriate difference 
formulas. Next, in the block tri-diagonal version, each of the three 5x5 coefficient matrices must 
be evaluated at every grid point in the domain and then and only then can each factor be inverted. 
For the scalar penta- diagonal version, each of the five scalar coefficients of the penta-diagonal sys- 
tem must be evaluated at every grid point for each variable before the penta-diagonal factor can 
be inverted. Finally, the flow variables, Q , must be updated: 

Q n+1 = Q n + A Q n . 

We refer to these four tasks as BC, RHS, IMPLICIT and UPDATE, respectively. 

In general, a typical application involves thousands of time steps. This amortizes the setup cost 
over the many iterations and hence the initialization cost is insignificant. The work involved in 
UPDATE is no more than five independent floating point additions at each grid point and hence 
is also negligible. So we restrict out attention to the three major tasks, BC, RHS and IMPLICIT. 

The data dependencies of each of these three operations are quite different. In BC, only the 
subdomain consisting of boundary points are updated using values from only those interior points 
which are close to the boundary points. The pressure on the solid body is determined via the normal 
momentum equation, and because central differences of tangential derivatives of the pressure are 
used, scalar tridiagonal inversions must be performed along the solid body. RHS is explicit and is 
therefore completely data parallel, but some of the calculations must be done at all grid points with 
the final computation at only the interior points. Finally, IMPLICIT involves either the inversion 
of (5 X 5) block tri-diagonal systems or scalar penta-diagonal systems in each of the three coordinate 
directions. 

The number of floating point operations per grid point for RHS and IMPLICIT in the non- 
viscous case are shown in Table 1 for the two versions of ARC-3D. The number of floating point 
operations per grid point in BC are relatively small compared to those in RHS and IMPLICIT and 
for that reason they are not shown. Clearly, for both versions of ARC-3D, the cost of IMPLICIT 
is the dominant cost. 

7 Algorithms 

To bring out the salient points for parallel implementations, we now describe the three main tasks 
of ARC-3D at the algorithmic level. First, the algorithmic aspects of IMPLICIT are presented in 
some detail and then those for RHS and BC are briefly described. The task of IMPLICIT consists 
of solving linear systems by inverting either block tri-diagonal or scalar penta-diagonal matrices. 
In the block tri-diagonal case, for each direction as many systems need to be solved as there are 
lines of grid points in that direction. In the scalar penta-diagonal version, for each variable and for 
each direction, the number of systems solved is equal to the number of lines of grid points in that 
direction. In addition to inverting scalar penta-diagonal matrices, this version requires inversion of 
block diagonal matrices. However, the inverse of each block is known analytically, and the block 
diagonal inversions can be done without requiring information from any other grid point. The 
algorithms used for solving block tri-diagonals and scalar penta-diagonals are similar to those used 
for inverting scalar tri-diagonals, which we describe next. 

Sequentially, the most efficient algorithm for inverting a tri-diagonal system is the Thomas 
algorithm which is a special case of the Gaussian elimination. To solve a system of size n, with 
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coefficients A , B and C and right hand side /, given by, 

AjUj-i + Bj u j + Cj u j+i = fj 

where u 0 and u n +i are known, the Thomas algorithm performs two sweeps over the unknowns ui 
through u n . In the forward sweep, coefficients Pj+\ and qj+ 1 , j = 1, * • • , n, are computed, where 

Pj+i = (Bj — AjPj) Cj 

*7j+ 1 = {Bj — AjPj) (fj — Ajqj). 

The unknowns uj y j = n, • • • , 1, are determined in the back sweep from 

u j = <?j+l — Pj+lUj+i- 

Note that the computation requires storage of all and gj + i from the forward sweep for com- 
puting the unknowns Uj in the back sweep. Clearly, both phases of the Thomas algorithm have 

recursive data dependencies and hence the algorithm, when used for solving a single system, is 
sequential. However, when several independent such systems need to be solved as in the case of 
IMPLICIT, the computations of these systems can be pipelined. Thus, the above algorithm can 
be parallelized by assigning to each processor parts of computations from several independent tri- 
diagonal system solves. In such a scheme, if a processor is to compute unknowns Uj p through Uk p 
from one of the systems, then it receives the coefficients Pi, and qj p from a preceding processor 
and then proceeds to compute the coefficients through Pk p and q^ p and sends the final coefficients 
Pk p and qk p to the succeeding processor. The same is repeated for all other systems assigned to 
it. After completing all the forward sweeps, the back sweeps are performed. First the values of 
Uk p +i are received for each system from the succeeding processor and back sweep is completed by 
computing the unknowns Uk p through uj p . The computed value of Uj p is sent to the preceding 
processor and the process is repeated for all the systems assigned to that processor. This we refer 
to as the pipelined version of the Thomas algorithm. Note that all the coefficients from all the 
forward sweeps must be stored until the back sweeps are completed. 

The tri-diagonals can also be computed using the substructuring or the cyclic reduction algo- 
rithms (See, e.g., Ho and Johnsson [10]). However, these schemes are computationally expensive. 
The cost of Thomas algorithm for scalar tri-diagonals is approximately 8 floating point operations, 
whereas the cyclic reduction type algorithms require about 17 floating point operations per vari- 
able. The Thomas algorithm is a sequential algorithm and if only one tri-diagonal system is to 
be solved, then the available of parallelism is negligible. The cyclic reduction type algorithms, on 
the other hand, are highly parallelizable. In the next section we describe partitioning schemes for 
extracting parallelism from the Thomas algorithm in the context of ARC-3D and show that the 
available parallelism is sufficient for the machine model assumed here. 

For the block tri-diagonal case, each of the coefficients A and C is a dense 5x5 block, matrix 
B is a 5 X 5 diagonal block, and Uj and fj are vectors of length 5. The computed coefficients Pj 
and qj are, respectively, dense 5x5 matrix and 5x1 vector. Thus, in addition to the cost of 
computing the coefficients, the corresponding Thomas algorithm involves factoring a 5 X 5 matrix 
and computing six matrix-vector products at each grid point. The scalar penta-diagonal case has 
five scalar coefficients instead of three for each variable at each grid point. The corresponding 
Thomas algorithm solves penta-diagonal systems in a similar fashion as the tri-diagonal system. 
Note that the dependencies are in the form of the second order recurrences. For each variable, the 
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forward sweep computes three coefficients and the computation of Uj in the back sweep involves 
values at Uj+\ and 2 * 

In RHS, the fourth order spatial discretization of the right hand sides leads to 13 point stencils at 
the interior points, modified to smaller stencils near the boundary. These calculations are performed 
at each grid point and they involve explicit data dependencies, i.e., they use known information. 
Therefore, the computations for RHS may be completed independently. The implementation is 
similar to Jacobi relaxation and the computations may be performed in any order. 

In BC, the inflow/outflow boundary points can also be updated independently, but the values 
at boundary points along the solid body are coupled and must be updated using tridiagonal solves 
in both the £ and 77 directions. We use the scalar Thomas algorithm for these tridiagonal solves. 

8 Parallel Implementation 

Efficient parallel implementation requires extraction of maximum parallelism with negligible com- 
munication overhead. This entails partitioning of the data across processors and rearranging the 
computations so that the load is evenly balanced and the communication overhead is kept to a 
minimum. In practice achieving both the goals simultaneously is difficult. Furthermore, since the 
data dependencies of BC, RHS, and IMPLICIT are not the same, reaching this goal is even harder. 
We describe here a class of partitioning schemes that are suitable for MIMD architectures where the 
number of processors is small compared to the number of grid points. For the following discussion 
assume that p processors are available and that the computational domain is an / X m X n grid. 



1-D partitioning 


2-D partitioning 


3-D partitioning 


Figure 3: Uni-partition schemes. 


8.1 Uni-partition Schemes 

We refer to a partitioning scheme as a uni-partition scheme when the domain of computation is 
subdivided into p partitions and one partition is assigned to each processor. There are several 
ways of partitioning the domain in this manner. The simplest and the most commonly used are 
those where the domain is partitioned evenly, or almost evenly, along one, two, and three spatial 
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dimensions. If the computational domain has at least p grid points in one of the spatial directions 
then partitioning the grid into p slices along that direction gives the 1-D uni-partition scheme. If 
the number of grid points on a plane is at least equal to p, then the 2-D uni-partitioning scheme 
is possible. Here the grid is partitioned along two spatial directions. Similarly, in the 3-D uni- 
partitioning scheme, the domain is divided along all the three dimensions. See Figure 3. 

The 3-D partitioning scheme has the smallest surface to volume ratio whereas the 1-D scheme 
has the largest. Thus, for RHS, where the data dependencies are local, the 3-D partitioning scheme 
reduces the volume of data communicated. However, in this scheme, some processors may have 
to communicate with at least six other processors and possibly with up to 12 other processors. 
This scheme is also unattractive if a three dimensional mesh cannot be mapped efficiently on the 
underlying inter-processor communication network. Furthermore, the use of 13 point stencil in 
RHS makes it necessary to maintain a buffer containing two rows or two columns worth of extra 
information from the neighboring partitions at each surface of the partition that is adjacent to 
another partition. This memory overhead increases as the number of grid points per processor 
decreases. In the extreme case where one grid point is assigned to a processor, the information at 
the surrounding 12 grid points may have to be buffered. 


Sequential Delay 

Memory Required 

0{n) 

0(n 3 /p) 

0(n 2 ) 

0(n 2 /p) 

0(n 3 ) 

0(n/p) 


Figure 4: The pipelining delay and memory requirement tradeoff. 


For the IMPLICIT part, the 1-D partitioning scheme requires inter-processor communication 
in only one of the three directions. Thus, for the 1-D partitioning scheme, the adverse effect of the 
data dependencies is restricted along only one direction. The effect of data dependencies may be 
further reduced by pipelining the forward and back sweeps of IMPLICIT, as observed in (Johnson, 
Saad and Schultz, [14]). The larger the number of lines that can be pipelined, the lesser the effect of 
sequential data dependencies present in the Thomas algorithm. If there is no pipelining, then each 
of the line solves is done sequentially, irrespective of the number of processors used. However, if all 
the line solves are pipelined, then the net effect is that the computation may be completed in time 
equal to one line solve performed sequentially plus the time required to compute the remaining line 
solves with perfect parallelism. This cost can be further reduced by starting the computations from 
both sides simultaneously. This reduces the sequential cost of filling the pipe by half. In the 2-D 
and 3-D partitioning schemes, the data dependencies of IMPLICIT are spread across processors 
along two and three dimensions, respectively. This gives rise to data dependency delays in two and 
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three directions. Thus, if no pipelining of computations is performed, then the 3-D partitioning 
scheme is the most adversely affected and has very little parallelism. With full pipelining, the 2-D 
and 3-D cases have a delay of two and three sequential line solves, respectively. Note from Table 1, 
that for the type of problems considered here, the cost of one sequential line solve can be very high 
and may quickly dominate the total parallel cost. This is particularly true when the number of grid 
points assigned to each processor is small. By assigning a sufficiently large number of grid points 
to each processor, one may reduce the effect of the sequential part in the pipelined computation. 
However, there is a penalty to be paid in terms of memory. In order to be able to pipeline the 
computations, it is necessary to store the intermediate results of all the forward solves that are 
pipelined. This entails a memory overhead of storing an extra 5x5 matrix at each grid point for 
the block tri-diagonal case. In the pent a- diagonal case the memory overhead is eleven coefficients 
per grid point along each direction. This gives rise to an interesting tradeoff between the memory 
requirement and the benefit of pipelining the IMPLICIT computations. See Figure 4. 

Under the uni-partitioning scheme, the substructuring or the cyclic reduction type algorithms 
may also be used in IMPLICIT. These algorithms have higher degree of available parallelism, but 
there is a penalty of almost doubling the computation cost. Thus, the 1-D and 2-D partitioning 
schemes with substructuring generally fair poorly as compared to the Thomas based IMPLICIT. 
For 3-D partitioning schemes, the communications is global in all the three dimensions, making 
these schemes communication intensive. 

In Table 2, some of the issues described above are quantified for computing scalar tri-diagonals 
along all the three dimensions of an n x n x n grid and using p processors, neglecting lower order 
terms. These complexities assume p < n, though the complexities for the 2-D Pipelined and 3-D 
Diagonal scheme remain valid for p < n 2 and the 3-D complexity is valid up to p < n 3 . The 
multi-partition schemes are described in the next section. The pipelined methods use the Thomas 
algorithm with 1-D, 2-D and 3-D partitioning. The arithmetic operations count gives the total 
parallel computation cost, including the dependency delays. The column ‘words copied’ gives the 
amount of data that must copied into the memory because of the partitioning scheme used. The 
number of messages is the minimum number of messages that must be encountered in a sequence 
from the beginning to the end of the computation. This sequence of messages represents minimum 
communication delay in that partitioning scheme because of the message overhead. This number 
takes into account all the messages sent by all the processors in parallel. The message size is in 
terms of the number of words per message. For Thomas based algorithms, two words must be sent 
per line in the forward sweep and one word in the back sweep. The message size for the forward 
sweep could be increased to 2 words, reducing the number of messages sent in the forward sweep 
by a factor of 2. The substructure algorithm uses 1-D partitioning, with the Thomas algorithm in 
the two directions which need no communication and the substructure algorithm in the direction 
which crosses processor boundaries. 

There are two major drawbacks of the uni-partitioning schemes described above. First, load 
balancing is difficult. In practice, the problems to be solved may have grid sizes that cannot be 
evenly divided among the processors. Thus, if the granularity of the load distribution is to be 
retained at grid level, then some partitions may end up with a higher number of grid points to 
compute than the others. Since the computations per grid point are significant, the effect of this 
load imbalance may be considerable. This is especially true when the number of processors is 
large and the number of grid points assigned to a processor is relatively small. For the partitioning 
schemes described above, the problem is compounded by the fact that the load imbalance may be of 


14 




arithmetic 

op’s 


number of 
messages 

message 

size 

Uni-partition 





Pipelined: 

1-D 

24^(1 + &) 

- 

3(n 2 + p ) 

i 

2-D 

24^(1 + £r) 

- 

6(1 + £)£ 

i 

3-D 

24^(1 +£) 

- 

9(1 + 5 )^ 

i 

Substructure 

33^ 

20n 2 

21ogp 

10n 2 

Multi-partition 





Transpose 

24— 

p 

8^-logp 

21ogp 

to 

10 1 3 - 

Diagonal: 

2-D 

to 

- 

6» 2 

1 

3-D 

24^ 

- 

9n 2 

1 


Table 2: Complexities of scalar tridiagonals in three dimensions; n 3 points, p processors. 


the order of lines of grid points or even planes of grid points. This adversely affects the performance 
irrespective of whether only a small number of processors or a large number of processors are used. 
If non-standard partitions are used, then the load may be better balanced, but the communication 
and programming costs may turn out to be unreasonable. The second drawback applies even when 
the grid can be evenly divided among the processors. This is because of the uneven distribution 
of the computational work at the grid points. The interior points which form the major bulk 
of the computation have higher amounts of computational work associated with them than the 
grid points on the boundary. With this unevenness, any grid-level partitioning scheme results in 
a load imbalance. Furthermore, the data dependencies of BC, RHS, and IMPLICIT are totally 
different and BC applies only to the boundary points, RHS is applied over all the grid points, 
and IMPLICIT is applied only to the interior points. This adds to the data dependence delays 
even if the computational work is perfectly divided among the processors, since the processors with 
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partitions that include the boundary grid points are required to work on BC and RHS while those 
with only the interior points are not. One way to reduce the detrimental effects of load imbalance 
is to assign multiple partitions to each processor. This is described in the next section. 


8.2 Multi-partition Scheme 


In a multi-partition scheme, the computational domain is divided into a number of sub-partitions 
that is larger than the number of processors and each processor is assigned more than one such sub- 
partition. An obvious advantage is that the granularity of each sub-partition may be made much 
smaller than that in the uni-partition scheme. This gives better control over the computational 
load distribution. Moreover, as we describe next, not only is it possible to better balance the overall 
computational work, but also the load may be balanced at the task level so that the additional 
data dependency delays are not introduced. In the following, we describe one such scheme that 
we refer to as the diagonal partitioning scheme. A partitioning scheme similar to the 2-D version 
presented below is described by Johnsson et al. for implementing ADI methods on multiprocessor 
systems [14]. 



2-D diagonal 
partitioning 


3-D diagonal 
partitioning 


Figure 5: Multi-partition schemes. 


In the diagonal partitioning scheme, the domain is partitioned in a manner similar to that of 
the 2-D or the 3-D uni-partitioning scheme described above. We refer to these as the 2-D diagonal 
and 3-D diagonal partitioning schemes, respectively. In the former case, the domain is subdivided 
into p 2 sub- partitions and in the later case it is subdivided into p 3 / 2 sub-partitions. In the 2-D 
diagonal partitioning scheme, each processor is assigned one sub-partition from each row and from 
each column of sub-partitions. Similarly in the 3-D diagonal partitioning scheme, each processor is 
assigned one sub-partition from each plane of sub-partitions. This arrangement results in assigning 
sub-partitions to a processor that form diagonal chains in two or three dimensions. Note that, for 
the 2-D diagonal partitioning scheme to be applicable, the computational grid should have at least 
p grid points on each side of one of the 2-D planes and for the 3-D diagonal partitioning case the 
computational grid should have at least p 1 / 2 grid points along each of the three dimensions. 

The above described diagonal partitioning and assignment schemes require each processor to 
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compute all the three tasks: BC, RHS, and IMPLICIT. Because each processor is assigned one sub- 
partition from each row and column of sub-partitions, the computation is much better balanced 
even when the number of grid points along each dimension of the grid is not evenly divisible by p 
or p 1 / 2 . An interesting aspect of the 2-D diagonal partitioning scheme is that any processor may 
need to communicate with only two other processors even though there are four sub-partitions 
surrounding each sub-partition. In the 3-D case, each processor needs to communicate with six 
other neighbors. 

The diagonal schemes, however, have significantly larger surface area that is adjacent to other 
sub-partitions. This results in higher communication requirements. The communication and com- 
putation requirements in the scalar tridiagonal case for the 2-D diagonal partitioning scheme are 
shown in Table 2. 

If the communication overhead is not substantial then the diagonal partitioning schemes have a 
clear advantage over the 2-D and 3-D uni-partitioning schemes described earlier. This can be seen 
by observing the maximum achievable speedups. For example, with the 3-D diagonal partitioning 
scheme applied to an n x tl X u grid and using 7i 2 processors, the maximum achievable speedup is 
n 2 . The same with the 2-D uni-partitioning scheme is 3n 2 /5 and with 3-D uni-partitioning scheme 
it is 7i 2 /2. 

Another variation of multi-partitioning scheme is the case where the domain is partitioned into 
p 2 sub- partitions as in the case of 2-D uni-partitioning scheme, but the partitions are switched 
during the computation of IMPLICIT. Here, each processor is initially assigned p sub-partitions 
all from the same row of sub-partitions. After completing the IMPLICIT in two of the directions 
that lie in the plane of sub-partitions assigned to the processor, the processors transpose the sub- 
partitions before working on the final dimension of IMPLICIT. With this scheme, there is no 
communication during the IMPLICIT computation except for the transpose. The advantage is 
that there are no data dependency delays. The costs associated with this scheme for the scalar 
tri-diagonal case are shown in Table 2. 

9 Implementation on Victor 

We now describe some of the results from the implementations on the Victor multiprocessor system 
developed at T. J. Watson Research Center, Yorktown Heights. Victor is a message passing MIMD 
architecture with Inmos T800 transputer as the processing unit. Each node is associated with 4 
MB DRAM and is connected to four other nodes forming a 2-dimensional grid interconnection. 
Victor is a modular architecture that can be configured from 16 to 256 nodes (Wilcke et. al.,[28]). 
All the implementations were made using the Inmos 3L parallel Fortran environment. The 3L— 
supported thread routines were used to mimic an asynchronous communication environment for 
overcoming some of the difficulties posed by the synchronous communication supported by the 
transputer hardware. 

For all the results presented here, we use the flow past a semi-infinite hemisphere-cylinder 
body as a model problem. The computations were performed on a grid with 30 points in the 
axial direction, 12 points circumferentially and 30 points in the normal direction. Figures 6 and 7 
show the body and the associated grid in the curvilinear coordinates. All the computations were 
performed with double precision arithmetic. 

We have considered three different partitioning schemes for both the block tri-diagonal and the 
scalar penta-diagonal versions. These are the 1-D and 2-D uni-partitioning schemes, and the 2-D 
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Figure 6: The hemisphere-cylinder body. 


diagonal partitioning scheme. In all the cases the partitioning of the computational domain was 
along the axial and the normal directions only. 



(1.1) 

(2,2) 

(3,3) 

(M) 

BC 

.988 

.501 

.385 

.317 

RHS 

30.1 

8.42 

4.26 

2.44 

IMPLICIT 

199. 

51.4 

26.4 

13.2 

TOTAL 

230. 

60.3 

31.0 

15.9 


Table 3: Timings for the block tridiagonal version, using 1, 4, 9 and 16 processors (in seconds per 
time step). 

The results from the implementations of the scalar penta-diagonal and block tri-diagonal ver- 
sions of ARC-3D using the 2-D uni-partitioning schemes are compared in Tables 3 through 6. In 
all the cases the timings are given for each of the three tasks for four different processor config- 
urations. These are a 1 processor (1,1), 4 processors (2,2), 9 processors (3,3), and 16 processors 
(4,4). In Tables 3 and 4 actual wall clock timings per time step are given. The speedups over one 
processor timings are given in Tables 5 and 6. It can seen that, though the block tridiagonal version 
is more expensive than the penta-diagonal version, it is more efficient in extracting the available 
parallelism. 

An interesting observation is the comparison of the performance of 9 processors with that of 


18 



Figure 7: A cross-section of the grid 


in curvilinear coordinates. 



(1.1) 

(2,2) 

(3,3) 

(4,4) 

BC 

.988 

.560 

.426 

.438 

RHS 

30.2 

8.52 

4.36 

2.57 

IMPLICIT 

64.6 

17.1 

8.93 

4.49 

TOTAL 

95.8 

26.2 

13.7 

7.50 


Table 4: Timings for the penta-diagonal version, using 1, 4, 9 and 16 processors (in seconds per 
time step). 

16 processors. For the problem size considered here and for the 2-D uni-partitioning schemes, the 
interior points are evenly divisible among the 16 processors, but are not divisible equally among 9 
processors. The effect of this load imbalance is clear from the Tables 5 and 6. 

The relative performance of the three tasks BC, RHS, and IMPLICIT are also to be noted. 
As expected, BC performs poorly, because most of the computation in BC is along the solid body 
which is computed by only 4 processors in the case of 16 processor implementation and by only 3 
processors in the case of 9 processor implementation. RHS, although easily parallelizable, performs 
poorly as compared to IMPLICIT mainly because of the load imbalance. The partitions divide 
the interior points evenly, but this is not true when the boundary points are also included. In 
RHS, computations are performed at the interior as well as at the boundary points which affects 
its performance. IMPLICIT has only the interior points to compute and it performs very well. 
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Even here there is a slight load imbalance that is not obvious. The load imbalance is caused by 
the fact that the computations at the first point of a line solve are considerably less than those at 
the rest of the interior points. However, the sequential effect of the pipelining of IMPLICIT masks 
this imbalance. 

The diagonal multi-partitioning schemes are much harder to implement than any of the uni- 
partition schemes. The difficulty arises in efficiently managing the multiple sub-partitions assigned 
to each processor. Since the number of sub-partitions assigned to a processor increases as the total 
number of processors working on the problem increases, the overhead for large number of processors 
can be very high, especially in the 2-D diagonal case where the relative gains are small. 



(1.1) 

(2,2) 

(3,3) 

(M) 

BC 

1.00 

1.97 

2.57 

3.12 

RHS 

1.00 

3.57 

7.07 

12.3 

IMPLICIT 

1.00 

3.87 

7.54 

15.1 

TOTAL 

1.00 

3.81 

7.42 

14.5 


Table 5: Speedups for the block tridiagonal version, using 1, 4, 9 and 16 processors. 

Currently, we are in the process of completing this implementation and the results seen so far 
are encouraging. For the 2-D diagonal version, the assignment scheme results in each processor 
having to communicate with only two other processors. These neighboring processors remain 
the same throughout the entire computation. Thus, if the processors can be arranged in a loop, 
the 2-D diagonal version can be mapped onto these processors which results in nearest neighbor 
communication for BC, RHS, and IMPLICIT. Mapping a loop onto a mesh is relatively easy. 
However, to retain nearest neighbor communication only the loops with even number of processors 
are permissible. 



(M) 

(2,2) 

(3,3) 

(4,4) 

BC 

1.00 

1.76 

2.32 

2.26 

RHS 

1.00 

3.54 

6.93 

11.8 

IMPLICIT 

1.00 

3.78 

7.23 

14.4 

TOTAL 

1.00 

3.66 

6.99 

12.8 


Table 6: Speedups for the penta-diagonal version, using 1, 4, 9 and 16 processors. 


10 Conclusions 

We have attempted to show, by analysis and experimentation, the extent to which CFD applications 
based on implicit scheme can be parallelized. A specific example of an implicit scheme involving 
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line solves in all three coordinate directions was considered. It is observed that the affect of the 
various data dependencies of the different parts of the algorithm adversely affect the parallelism, 
but these can be minimized with an appropriate partitioning strategy. 
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